Author: Hanna-Leena – title: “IODS course project” output: html_document: theme: cosmo toc: true toc_depth: 2 fig_caption: true fig_width: 6 fig_height: 4 —


About the project

Difficult course for me. Never done this before. Hoping to learn the basics of R-programme. I heard of the course from my colleagues. Write a short description about the course and add a link to your GitHub repository here https://github.com/Hanna-Leena/IODS-project. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.


Regression and model validation

Description of the dataset

students2014<-read.csv("learning2014.csv", header = TRUE, sep= ",")

# Looking at the structure

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(students2014)
##   gender Age Attitude     deep  stra     surf Points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21

This data has 166 observations which means 166 students. This data has 7 variables. The 7 variables are gender, age, attitude, deep, stra, surf and points. The attitude variable gives information of the students’ global attitude toward statistics. Points mean exam points and their points related to different aspects of learning (Deep, strategic and surface learning).

Show graphical overview of the data and show summaries of variables in the data

pairs(students2014[-1], col=students2014$gender)

summary(students2014)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

This picture is not actually helping me to understand what kind of data I am prosessing. I like the summarize table and actually I use it quite often when getting familiar with my data. Let’s look at it more closely.

summary(students2014)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

There are 110 females and 56 males. Minimum age is 17, median 22, mean 25.51, maximum 55. There are also quartiles you can use; 1st quartile 21, 3rd quartile 27. Exam points are minimum 7, median 23, mean 22.72, maximum 33.0, 1st quartile 19, 3rd quartile 27.75. Global attitude minimum 14, median 32, mean 31.43, max 50, 1st quartile 26, 3rd quartile 37. Deep questions points minimum 1.58, median 3.667, mean 3.68, 3rd quartile 4.083. And so on. Not everybody like these kinds of table but for me it gives quick insight into my data and I use it quite often.

Use the packages GGally and ggplot2 and get some help with the graphical overview

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

p <- ggpairs(students2014, mapping = aes(alpha=0.5), lower = list(combo = wrap("facethist", bins = 20)))

p

This way the data is more easily readable. Almost all the variables are normally distributes. The age is skewed. The picture also shows that there is not a very strong correlation between different variables since the correlations are between -0.3 - 0.4.

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.

model <- lm(Points ~ gender + Attitude + stra, data = students2014)
summary(model)
## 
## Call:
## lm(formula = Points ~ gender + Attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97982    2.40303   3.737 0.000258 ***
## genderM     -0.22362    0.92477  -0.242 0.809231    
## Attitude     0.35100    0.05956   5.893 2.13e-08 ***
## stra         0.89107    0.54409   1.638 0.103419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

This test studied the association of exam points (target value, which was asked) with gender, attitude and strategic learning (explanatory variables). It can be seen that the Attitude is the only value that is statistically siginificant (p-value is < 0.05.). Actually the p-value is very low.

Next we are using a summary of my fitted model, to explain the relationship between the chosen explanatory variable and the target variable.

I am doing a regression model of with the one significant explanatory variable “Attitude”.

model_2 <- lm(Points ~ Attitude, data = students2014)
summary(model_2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

#These results mean that the attitude’s estimate is 0.35 and the p-value stays still <0.05 being statistically significant. In other words this means that when the attitude increases by 1 unit the exam points increase by 0.35.

Exlaining the the multiple R squared of the model.

R-squared is a statistical method showing how close the data is to the fitted regression line. Meaning how well does the model fit to my data. In general it can be said that the higher the R-squared, the better the model fits to the data. The definition of R-squared is the percentage of the response variable variation that is explained by a linear model.R-squared = Explained variation / Total variation R-squared, it is always between 0 and 100%. In this summary we can see that the multiple R-squared is 0.1906 so 19% which means that this model explains only 1/5 of the exam points around their mean. R-squared does not determine whether the coefficient estimates and predictions are biased. This is why you must assess the residual plots.

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

There are several assumptions in the linear regression model. With the help of these plots you can analyze the residuals of the model and see how well the linear regression model is working or are the some problems with it.

plot(model_2, which = c(2,1,5))

Residuals vs Fitted values

This plot shows that the residuals have constant variance. You can find an equally spread residuals around the horizontal line without distinct patterns.

Q-Q plot (normality)

With the Q-Q plot you can explore that the residuals are normally distributed. As you can see the points are very close to the line. There are the upper and lower tails which have some deviation. I think this is acceptable. I would interpret that the errors are normally distributed.

Residuals vs Leverage

This plot helps you to understand if there are outliers in the data that are influencial in the linear regression model. In this analysis all the cases are inside the Cook’s distance lines.


Week 3, Logistic Regression

Reading the data

This week we will learn about logistic regression. Before analysing the data I have been done some wrangeling. It wasn’t as diffucult as week before. Now the data is ready to be analyzed. The following changes have been made: the variables not used for joining the two data have been combined by averaging. The variable ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’and ’high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

alc <- read.csv("alc2014.csv", header = TRUE, sep = ",")

#Looking at the structure

dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
summary(alc)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 

The modified data consists of 382 observations of 35 variables as seen above. The data tells students’ achievements in secondary education of two Portuguese schools. This data was collected by questionaires and school reports. Originally we had two datasets regarding the performance in two distinct subjects: Mathematics (called mat) and Portuguese language (called por). If you want, you can find more information about the original dataset here: https://archive.ics.uci.edu/ml/datasets/Student+Performance.

The purpose of my analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data.

To do this I choosed 4 variables in the data. And for each of them I had present my own personal hypothesis of their relationship with alcohol consumption. I chose the variables failure, absences, famrel and higher. 1. Variable “failure” tells about the number of past class failures which might be linked to higher alcohol consumption. When you think with common sense these two variables could have correlation, which is why I chose them. 2. The second chosen variable is “absences”. The variables measures school absences. Again when you think with common sense it could be explainde that those who drink more alchol have more school absences. Thus is why I want to see is there correlation or not. 3. The third chosen variable is “famrel”. It means the quality of family relationships. It is commonly known that lower social status is connected to more alchol consumption. Now we will find out is there correlation between these two variables in this data. 4. The fourth chosen variable is “higher”. This variable tells if the student wants to achieve higher education. It can be argued that those who study more and harder, have lesser time for partying and for instance alchol use. So we will find out it those who want a higher education consume alcohol less.

Next wer are going to numerically and graphically explore the distributions of my chosen variables and their relationships with alcohol consumption. We are going to use for example cross-tabulations, bar plots and box plots.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

First we explore the relationship between alcohol consumption and number of past class failures

table_1 <- table(high_use = alc$high_use, number_of_past_class_failures = alc$failures)

addmargins(table_1)
##         number_of_past_class_failures
## high_use   0   1   2   3 Sum
##    FALSE 244  12  10   2 268
##    TRUE   90  12   9   3 114
##    Sum   334  24  19   5 382
failurel <- ggplot(alc, aes(x = high_use, y = failures))
failurel + geom_count()

From this table and plot-picture you can see that the ones who have no past class failures consume less alcohol. Those ones who consume alcohol more have more class failures.

Second we explore the relationship between alcohol consumption and absences

g <- ggplot(alc, aes(x = high_use, y = absences))

g2 <- g + geom_boxplot() + ggtitle("High alcohol consumption and absences")

g2

From this picture we can see that those ones who consume less acohol they have less school absences. Those onse who consume more alcohol have more school absences. So there might be correlation between these two variables.

Third we explore the correlation between acohol consumption and family relationships

g3 <- ggplot(alc, aes(x = high_use, y = famrel))

g4 <- g3 + geom_boxplot()

g4

The students who consume less acohol have higher scores from their family relationships. Those student who consume more alcohol have lower scores from their family relationships. From this you can not say which one causes which. Do the students start using alcohol because of bad family relationships or do family relationships get worse when the student starts using more alcohol.

Fourth we are exploring the hopes for higher education and alcohol consumption

table_2 <- table(high_use = alc$high_use, wants_high_education = alc$higher)

table_2
##         wants_high_education
## high_use  no yes
##    FALSE   9 259
##    TRUE    9 105
round(prop.table(table_2) * 100, 1)
##         wants_high_education
## high_use   no  yes
##    FALSE  2.4 67.8
##    TRUE   2.4 27.5

From these two tables you can see that it looks like those who consume more alcohol, they don’t want to have higher education.

Next wer are going to use logistic regression to statistically explore the relationship between my chosen variables and the binary high/low alcohol consumption variable as the target variable.

m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + famrel + higher, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3593  -0.7890  -0.6902   1.1782   1.8993  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.12562    0.74528  -0.169 0.866143    
## failures     0.43484    0.19618   2.217 0.026653 *  
## absences     0.08341    0.02272   3.671 0.000241 ***
## famrel      -0.22708    0.12502  -1.816 0.069322 .  
## higheryes   -0.36274    0.55177  -0.657 0.510924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 436.43  on 377  degrees of freedom
## AIC: 446.43
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    failures    absences      famrel   higheryes 
## -0.12562333  0.43483926  0.08340567 -0.22707832 -0.36273561

As seen above the p-value is low (<0.05) in failures and absences. In famrel the p-value is close to 0.05. The coefficient in failures and absences is positive (failures 0.44, absences 0.08), meaning that more failures and more absences predict higher alcohol use. Familyrelationship seems to have negative effect (-0.22) on the high alcohol use, which means that better familyrelationship have protective effect on alcohol use. The future education plan meaning the plans to get a higher education have also negative effect on the high alcohol use (-0.4), but the p-value is not significant.

Next we are trying to present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them.

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.3
## ✔ readr   1.3.1     ✔ stringr 1.4.0
## ✔ tibble  2.1.3     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
OR <- coef(m) %>% exp()
CI <- exp(confint(m))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 0.8819470 0.1995909 3.776889
## failures    1.5447147 1.0499664 2.282173
## absences    1.0869827 1.0418018 1.139025
## famrel      0.7968584 0.6230747 1.019051
## higheryes   0.6957704 0.2372414 2.121797

From this table you can see the coefficients of the model as oddss ratios and their confidence intervals. The odds ratios for failures is 1.54 (and confidence interval is 1.05-2.28). Variable absences odds ratio is 1.09 (with confidence interval 1.04-1.14). This means that higher consumption of alcohol increases the odds for absence and failures by 1.54 and 1.08 times. The odss ratio for the better family relationship is 0.8 with confidence interval 0.6 - 1.02, which means that the better family relationship makes the odds for high alcohol consumption 0.8 times less likely. The fourth tested variable the higher educational plan was not statistically significant since the odds ratio is 0.70 and the confidence interval is 0.23-2.12. When a number 1 is included in the confidence interval, this means that there is no statistically significant difference. The findings we see can be said to support the earlier hypotheses we had.

Exploring the predictive power of the model

We are going to use the variables which, according to my logistic regression model, had a statistical relationship with high/low alcohol consumption. We are going to explore the predictive power of my model.

m <- glm(high_use ~ failures + absences + famrel + higher, data = alc, family = "binomial")

probabilities <- predict(m, type = "response")

alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, famrel, high_use, probability, prediction) %>% tail(10)
##     failures absences famrel high_use probability prediction
## 373        1        0      4    FALSE   0.2765114      FALSE
## 374        1        7      5     TRUE   0.3531843      FALSE
## 375        0        1      5    FALSE   0.1764851      FALSE
## 376        0        6      4    FALSE   0.2898242      FALSE
## 377        1        2      5    FALSE   0.2646186      FALSE
## 378        0        2      4    FALSE   0.2262058      FALSE
## 379        2        2      2    FALSE   0.5234763       TRUE
## 380        0        3      1    FALSE   0.3857482      FALSE
## 381        0        4      2     TRUE   0.3523118      FALSE
## 382        0        2      4     TRUE   0.2262058      FALSE
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.04712042 0.70157068
##    TRUE  0.25392670 0.04450262 0.29842932
##    Sum   0.90837696 0.09162304 1.00000000
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

g + geom_point()

The training error

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471

The training errors result is 0.30 meaning 30%. This means that almost 30% a.k.a 1/3 of observations are incorrectly classified. I do not know if this is very common when doing a model. I think the number is quite high and the model is not working as it should be working.

Perform 10-fold cross validation on this model

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) >0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3010471
library(boot)

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

cv$delta[1]
## [1] 0.3036649

Even after 10-fold cross validation the result remains the same. So 30% of observations are incorrectly classified. ***

Week 4, Excercise 4, Clustering and classification

This week we are learning how to cluster and classificate. First we need to load the Boston-data, from the MASS package.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

#Then we explore the structure and the dimensions of the data

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The data has 506 objects and 14 variables.This dataset is about housing values in suburbs of Boston. The variables are shortened, so for us to understand what they mean I have explainde the variables below. Variables: “Crim” means per capita crime rate by town. “zn” means proportion of residential land zoned for lots over 25,000 sq.ft. “indus” means proportion of non-retail business acres per town. “chas” means Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). “nox” means nitrogen oxides concentration (parts per 10 million) “rm” means average number of rooms per dwelling “age” means proportion of owner-occupied units built prior to 1940 “dis” means weighted mean of distances to five Boston employment centres “rad” means index of accessibility to radial highways “tax” means full-value property-tax rate per $10,000. “ptratio” means pupil-teacher ratio by town “black” means 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town “lstat” means lower status of the population (percent) “medv” means median value of owner-occupied homes in $1000s

Then we are exploring the data by doing a graphical overview and by showing summaries of the variables.

First I downlond few packages I could need during the excercise.

library(ggplot2)
library(GGally)
library(tidyverse)

Next we are going to look at the summary of the Boston dataset

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

If you look at the variables more closely, you can see that almost all the variables are normally distributed. You can know this by seeing that median and mean values are more or less close to each other. Variables “Zn” and “Crim” are not normally distributed. The variable “Chas” is a binary variable.

Next were are going to look for the correlations of the dataset.

#First a picture of the graphical overview

pairs(Boston)

This picture is not very clear. It is a picture of the pairs plot. All the blots are so small and so near to each other thatyou can actually see only black picture. From this picture you canno’t see the correlations so next we are going to look at the correlations more closely.

#Let's make a correlation matrix and draw a correlation plot

cormatrix <- cor(Boston)
cormatrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
library(corrplot)
## corrplot 0.84 loaded
corrplot(cormatrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

My R couldn’t find a library corrplot, even tough that was instructed in the datacamp. Finally I found a package of corrplot. So after downloading that I could look at the correlations in this picture, which is more clear to me. There is strong negative correlation (big red ball), between dis-nox, dis-age and dis-indus. Meaning that moving furher from Boston employment centers the Nitrogen oxide concentration goes down, the proportion of owner-occupied units built prior to 1940 goes down. This seems clear and logical. Also lower status of the population (lstat) and median value of owner-occupied homes (medv) have strong neagtive correlation. When the percent of lower status of the population gets bigger the median value of owner-occupied homes in $1000s gets smaller. This also is logical. A positive correlation is marked with a big blue ball. So if you look at the picture, you can see that rad and tax have a strong postive correlation. This means that when the index of accessibility to radial highways rises also the full-value property-tax rate per $10,000 rises.

Standardize the dataset and print out summaries of the scaled data

Standardizing the dataset

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# The boston_scaled is a matrix and we are going to change it to a data for the future
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

After the data has been scaled you can see from the summary that all the means and medians are close to each other meaning that now they are normally distributed. This will help us in scaling this dat in a fitted model. Next we are going to change the continuous crime rate variable into a categorical variable. We need to cut the crim variable by quantiles to get the high, low and middle rates of crime into their own categories. Then we are going to drop the old crim variable from the dataset and replace it with the new crime variable.

# Creating a quantile vector

bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# creating a categorical variable crime

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
library(dplyr)

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
dim(boston_scaled)
## [1] 506  14

The data has still 506 objects and 14 variables. The instructions said that we need to divide the dataset to train and test sets, so that only 80% of the data belongs to the train set.

# We are going to make the train and the test sets by choosing 80% observations to the train set and the rest of the observations are in the test set.
n <- nrow(boston_scaled)
n
## [1] 506
ind <- sample(n, size = n * 0.8)
dim(ind)
## NULL
train <- boston_scaled[ind, ]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.4872 -0.4872 -0.4872 0.0487 0.7133 ...
##  $ indus  : num  1.015 -0.1803 -0.0797 -0.4762 0.569 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  1.5991 -0.0923 -0.5669 -0.2649 -0.7827 ...
##  $ rm     : num  0.251 -0.244 -0.56 -0.388 0.224 ...
##  $ age    : num  0.8784 -0.3473 -1.6439 -0.0702 -0.532 ...
##  $ dis    : num  -0.8512 0.0982 0.0714 0.8384 -0.0613 ...
##  $ rad    : num  1.66 -0.637 -0.637 -0.522 -0.637 ...
##  $ tax    : num  1.529 -0.618 -0.779 -0.577 -0.82 ...
##  $ ptratio: num  0.8058 -0.0257 0.0667 -1.5037 -0.118 ...
##  $ black  : num  -3.606 0.433 0.441 0.426 0.42 ...
##  $ lstat  : num  0.7558 0.0108 -0.2497 -0.0312 -0.6292 ...
##  $ medv   : num  -1.40618 -0.16666 0.00731 0.03992 0.03992 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 4 3 2 2 1 3 1 2 3 2 ...

So now the train set had 404 observations and 14 variables. This is 80% of the 506 observations we began with.

Next we are creating the test set.

test <- boston_scaled[-ind, ]
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  -0.4872 -0.4872 -0.4872 0.0487 0.0487 ...
##  $ indus  : num  -0.593 -1.306 -1.306 -0.476 -0.476 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.74 -0.834 -0.834 -0.265 -0.265 ...
##  $ rm     : num  1.281 1.227 0.207 -0.93 0.131 ...
##  $ age    : num  -0.266 -0.511 -0.351 1.116 0.914 ...
##  $ dis    : num  0.557 1.077 1.077 1.086 1.212 ...
##  $ rad    : num  -0.867 -0.752 -0.752 -0.522 -0.522 ...
##  $ tax    : num  -0.986 -1.105 -1.105 -0.577 -0.577 ...
##  $ ptratio: num  -0.303 0.113 0.113 -1.504 -1.504 ...
##  $ black  : num  0.396 0.441 0.41 0.328 0.393 ...
##  $ lstat  : num  -1.21 -1.03 -1.04 2.42 1.09 ...
##  $ medv   : num  1.323 1.486 0.671 -0.656 -0.819 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 2 2 3 3 3 3 1 ...

The test set has 102 observations and 14 variables. This is 20% of the original data set.

Next were are going to fit the linear discriminant analysis on the train set

We are using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We are going to draw the LDA (bi)plot of the model.

#Fitting the linear discriminant analysis on the train set

lda.fit <-  lda(formula = crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#Target classes as numeric 

classes <- as.numeric(train$crime)

#Drawing a plot of the lda results

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

#Next we are going to save the crime categories from the test set and then remove the categorical crime variable from the test dataset. After that we are going to predict the classes with the LDA model on the test data.

#We are going to save the crime categories from the test set

correct_classes <- test$crime

library(dplyr)

#Next we are going to remove the categorial crime variable from the test dataset

test <- dplyr::select(test, -crime)

# Next we are going to predict with the LDA model on the test data

lda.pred <- predict(lda.fit, newdata = test)

#Then was asked to do a cross table of the results

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low        9      13        2    0
##   med_low    4      14        7    0
##   med_high   0       1       18    0
##   high       0       0        0   34

Here you can see how the model is working with the predictions. The model works well when predicting the high crime rates. The model makes errors when predicting the other crime classes.

#Next we are going load the Boston dataset again and standardize the dataset. We are going to calculate the distances between the observations. We are going to run k-means algorithm on the dataset. We are going to investigate what is the optimal number of clusters and run the algorithm again. In the end we are going to visualize the clusters and interpret the results.

library(MASS)
data("Boston")
dim(Boston)
## [1] 506  14

We now have loaded the Boston dataset again from the MASS-library. We wanted to check that we have the correct amount of observations 506 and variables 14.

Next we are going to measure the distance. I am going to use the Euklidean-distance, which is probably the most common one. K-means is old and often used clustering method. K-means counts the distance matrix automatically but you have to choose the number of clusters. I made the model with 3 clusters, because my opinion is that it worked better than 4 clusters.

scale_Boston2 <- scale(Boston)
scale_Boston2 <- as.data.frame(scale_Boston2)

#Next we are going to calculate the distances, the Euklidean-distance.

dist_eu <- dist(scale_Boston2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# K-means clustering

km <- kmeans(scale_Boston2, centers = 3)
#Plotting the Boston dataset with clusters
pairs(scale_Boston2, col = km$cluster)

pairs(scale_Boston2[1:6], col = km$cluster)

pairs(scale_Boston2[7:13], col = km$cluster)

Next were are going to investigate what is the optimal number of clusters. There are many ways to find out the optimal number of clusters, but we will use the Total of within cluster sum of squares (WCSS) and visualise the result with a plot.

# First determine the number of clusters

k_max <- 10

# Calculate the total within sum of squares

twcss <- sapply(1:k_max, function(k){kmeans(scale_Boston2, k)$tot.withinss})

# Next we are going to visualize the results

qplot(x = 1:k_max, y = twcss, geom = "line")

The optimal number of clusters is when the total WCSS drops radically. As you can see from the picture above this happens around, when x= 2. So the optimal number of clusters would be 2. Next we run the algorithm again with two clusters.

km <- kmeans(scale_Boston2, centers = 2)

pairs(scale_Boston2, col = km$cluster)

In this first picture all the variables are included. Because there are so many variables I think the picture is quite difficult to interpret and understand. That is why I choose to do two more plots, so that I can look at the effects more closely.

pairs(scale_Boston2[1:8], col = km$cluster)

pairs(scale_Boston2[6:13], col = km$cluster)

As you can see from the pictures above the variable “chas” doesn’t follow any pattern with any of the variables. These kind of pictures are hard for me to understand because I am doing this the very first time. I think, however, that there might be negative correlation between indus-dis, nox-dis, dis-lstat and positive correlation between indus-nox, age-nox, age-lstat.